fix(duplication): prevent potential crash when removing or pausing duplication#2368
fix(duplication): prevent potential crash when removing or pausing duplication#2368lengyuexuexuan wants to merge 5 commits intoapache:masterfrom
Conversation
|
Hi @lengyuexuexuan Thank you for your contribution! Please modify the code according to the suggestions provided by Clang Tidy and IWYU. |
done |
| return; | ||
| } | ||
|
|
||
| zauto_lock l(_lock); |
There was a problem hiding this comment.
Does _primary_confirmed_decree below no longer need protection by _lock?
acelyc111-bot
left a comment
There was a problem hiding this comment.
Review: Prevent potential crash when removing/pausing duplication
Summary: Fixes race conditions in replica_duplicator_manager by adjusting lock placement. Also adds thread safety to raw_task callback clearing.
What's good:
- Moving lock acquisition earlier in
sync_duplicationprevents accessing shared state before protection - Removing the redundant lock in
update_confirmed_decree_if_secondaryis correct sinceremove_all_duplicationsnow acquires the lock itself - The
raw_taskdestructor +clear_non_trivial_on_task_endlock prevents use-after-free on the callback during concurrent task completion
Concerns:
-
Lock scope change in
sync_duplication— The lock now covers the entire function including the earlyreturnpath and all the logic below. Previously only the mutation of_duplicationswas locked. This is correct for safety but increases the lock hold time. Ifsync_duplicationis called frequently or does heavy work, this could cause contention. Worth checking if there are any blocking calls inside. -
Double lock possibility —
update_confirmed_decree_if_secondarycallsremove_all_duplications()which now acquires_lock. If any caller ofupdate_confirmed_decree_if_secondaryalready holds_lock, this would deadlock. Please verify there are no such call chains. -
raw_task lock naming —
_lockis very generic. Consider_cb_lockor_callback_lockfor clarity.
Verdict: update_confirmed_decree_if_secondary → remove_all_duplications.
What problem does this PR solve?
#2211
What is changed and how does it work?
Background
During duplication of a table, if the commands
dup remove/pauseare executed ora balance operationis performed at the same time, there is a chance that a node may core dump with signalID 11. The core dump locations vary, but they all have one thing in common: they occur during memory allocation or deallocation.Analysis
Based on extensive testing, the following conclusions can be drawn:
a. The issue only reproduces when there is write traffic. The difference between having and not having write traffic is: It adds the ship and load_private_log tasks.
b. The core dump occurs during the execution of
cancel_all().c. The issue occurs with low probability (approximately 1 in 100).
Through analysis using ASAN (AddressSanitizer):
dup_remove_asan.txt
Based on ASAN analysis, the following conclusions can be drawn:
a. The memory corruption occurs during the ship process. The mutations obtained from replaying the plog are passed to ship, leading to the issue.
b.
_load_mutationsis captured by a lambda expression and then passed to astd::function. Sincestd::moveis used, the lifetime of_load_mutationsis tied to that of thestd::function.c. The
cancel_all()function is executed in the default thread pool. At this point, the following function is called. When thestd::functionis set to nullptr, it will release the memory it manages.incubator-pegasus/src/task/task.h
Line 341 in e64faa7
d. However, each task executes
exec_internal()in its own thread pool, and eventually callsrelease_ref(), which results in delete this.incubator-pegasus/src/task/task.cpp
Line 224 in e64faa7
Conclusion
1. Both
task.cancel()andtask.exec_internal()destruct the std::function member inside the task object. These two operations are executed in different threads, and there is no mechanism in place to prevent race conditions between them. As a result, it is possible for both threads to attempt to destruct the same std::function, which can lead to a double deletion of the memory associated with _load_mutations. This ultimately causes memory corruption.2.
_duplicationsis accessed without proper synchronization in certain functions under multi-threaded scenarios, potentially causing race conditions.Solution
_cbcallback to ensure that only one thread executes its destructor._duplicationswithout synchronization to prevent concurrent access conflicts.Tests
The changes have been production-validated at Xiaomi, running stably on more than 30 clusters for over six months, confirming that they resolve the concurrency issues described above.